Homework 05: Contours and Gradients

#1: We know that quadratics are useful because smooth objective
functions look like quadratics if you zoom in just right.
Consider the objective function for Ordinary Least Squares (OLS)
regression: sum of the squared residuals, where
residual = y data value - (m*x+b)

Is this function (sum of squared residuals) exactly quadratic
as a function of m & b ?   Show symbolically that
this objective function is or is not exactly quadratic, using
symbolic arguments (algebra) rather than numeric investigation.


#2: Re-do the contour plot using the sum of absolute values,
rather than the sum of squares.  You might need to adjust:
* the resolution of the grid we compute, or
* the # or value of contours, or
* the range of values that we scan,
to get a pretty picture. You may use the original xdata and ydata,
or the xdata and ydata after subtracting the mean of each.
Optionally, you could try doing both.


If you don't want to use Python or Matlab/Octave, you may use whatever other software
suits you. You can do it with a 2-dimensional "What If" table in Excel--ask if you're
curious. But Excel's contour graphs look rather yucky (to me).
It's even possible to do it in WolframAlpha, though you'd have
to enter each residual computation individually. Desmos.com is also a good option,
but again you might need to enter each residual computation individually.

Hint: abs(stuff) computes the absolute value of each element in
a vector or matrix named "stuff", giving you back a vector or
matrix of the same size.  For example, in Matlab,
abs([3 4 -5 0 -10]) = [3 4 5 0 10]
or in Python, use np.abs(whatever)

Copy/paste the contour plots into a Word file or the equivalent.
Comment on how much the contours are like ellipsoids.



#3: Re-do the contour plot using the maximum absolute residual rather than
the sum of squares or absolute values.  Hint: use max() instead of
sum(). If you're using python, you might need np.max instead of plain max.
Again, you might need to adjust the resolution of our grid
or the contours you choose. FYI, minimizing the maximum absolute residual
is called "Chebyshev" regression. It is usually a bad idea--it is very
susceptible to outliers.

Copy/paste the contour plots into a Word or Excel file or the equivalent.
Comment on how much the contours are like ellipsoids.



#4: gradients of Ordinary Least Squares
Consider the objective function:
sum (y_i - (m*x_i + b))^2

a) compute its gradient symbolically, by hand.
b) change the appropriate function (Matlab or Python) to include a computation of the gradient in the Ordinary Least Squares case.  


In Matlab, it will start like this:
function [objfuncval,gr] = objfunc(myslope,myintercept,datamat)

and then the last line will be something like
gr = (whatever formula will compute the gradient)
If you want to compute the gradient component-by-component, that is,
one formula for the partial with respect to myslope and another for
the partial with respect to myintercept, you can do
gr(1) = (formula for partial w.r.t. slope)
gr(2) = (formula for partial w.r.t. intercept)
Note that you don't have to actually run it in Matlab/Octave if you're struggling
with them. Just type formulas that they would be able to interpret.

Similarly, with python: if you're having trouble getting it running on your machine,
you can still get full credit on this problem by just doing the coding correctly.

Copy/paste the resulting code into your homework file.


#5 (from Winston, pg 628)
The energy used in compressing a gas in 3 stages from an initial pressure
"I" to a final pressure "F" is given by the formula
K*(sqrt(p1/I) + sqrt(p2/p1) + sqrt(F/p2) - 3)
where K is some arbitrary positive constant.  Let's use K=7. Formulate an NLP
whose solution describes how to minimize the energy used in compressing the gas.

a) Using the values I=1 (atmosphere) and F=10, solve the NLP numerically (using
Excel Solver or equivalent); attach your Solver file along with your Word file
when you submit the homework.

b) Based on the optimal p1 and p2 values you got, write down a formula
for the optimal p1 and p2 as a function of I and F. This involves some investigation
and guessing: try taking a look at p1/I, p2/p1, and F/p2.

c) Prove that your conjecture is indeed optimal (take the gradient, etc.)
Note: I am asking you to make a guess, then prove the guess is optimal by
showing it satisfies the appropriate equation.
You could go the other way: take the gradient, and try to solve the
equation: gradient=zerovector.  I think my way is easier.

d) Find the Hessian at the optimal point in part (a); your answer
should be a matrix with numbers in it, rather than formulas. You could
do this by finite differences, or by taking the Hessian symbolically
then plugging in values.
Reminder: while some textbooks define the Hessian as the determinant
of the matrix of 2nd derivatives, in optimization class the Hessian
is that matrix itself, not just its determinant or one element.



#6: (optional) re-do #2 and #3 using a different set of (x,y) data that
do not fall perfectly on a straight line. For example,
x y
1 1
1 3
5 5
5 7
Do there seem to be multiple optima? You might also try doing contour
of the original least-squares objective function.


#7: (optional) a Control Theory problem
Implement Example 4 (Control) in Chapter 7.3 in the
Luenberger textbook in Excel Solver or equivalent. Also try putting
different weights on the x variables vs. the u variables.
I'm also wondering: is the optimal solution a Proportional control?
That is, is u = (some constant)*x at each step?
Or is there some other simple rule like that?

#8: (optional) an Approximation problem
Problem 1 in section 7.10 in the Luenberger textbook:
how to approximate a function with a polynomial

-----------------------------------------------------
Here are some of the Desmos graphs that I showed in class:

Choosing Delta-x in Finite Differences:
https://www.desmos.com/calculator/2uve4ik24x

Newton Root, in 1 dimension:
https://www.desmos.com/calculator/owomfhwz3o
Newton Optimizing, in 1 dimension:
https://www.desmos.com/calculator/2nlhb7yjf7
or another version at
https://www.desmos.com/calculator/mmcbljru5l


Gradient and Taylor Series Contour/Newton Direction in 2 dimensions:
https://www.desmos.com/calculator/mgkpzuybef

------------------------------------------------------


Here is a very interesting presentation that talks about optimization and machine learning:
http://pages.cs.wisc.edu/~swright/wright-kdd-2013.pdf
Also interesting:
http://mrtz.org/blog/the-zen-of-gradient-descent/
and
http://sebastianruder.com/optimizing-gradient-descent/
An overview of gradient descent optimization algorithms


Also, below I'm including the list of ways that Matlab decides when to stop when optimizing.

% Here are the ways Matlab decides to stop:
%      1  Magnitude of gradient smaller than the specified tolerance.
%      2  Change in X smaller than the specified tolerance.
%      3  Change in the objective function value smaller than the specified
%          tolerance (only occurs in the large-scale method).
%      0  Maximum number of function evaluations or iterations reached.
%     -1  Algorithm terminated by the output function.
%     -2  Line search cannot find an acceptable point along the current
%          search direction (only occurs in the medium-scale method).